Automated machine learning (AutoML) can predict 90-day mortality after gastrectomy for cancer

Early postoperative mortality risk prediction is crucial for clinical management of gastric cancer. This study aims to predict 90-day mortality in gastric cancer patients undergoing gastrectomy using automated machine learning (AutoML), optimize models for preoperative prediction, and identify factors influential in prediction. National Cancer Database was used to identify stage I–III gastric cancer patients undergoing gastrectomy between 2004 and 2016. 26 features were used to train predictive models using H2O.ai AutoML. Performance on validation cohort was measured. In 39,108 patients, 90-day mortality rate was 8.8%. The highest performing model was an ensemble (AUC = 0.77); older age, nodal ratio, and length of inpatient stay (LOS) following surgery were most influential for prediction. Removing the latter two parameters decreased model performance (AUC 0.71). For optimizing models for preoperative use, models were developed to first predict node ratio or LOS, and these predicted values were inputted for 90-day mortality prediction (AUC of 0.73–0.74). AutoML performed well in predicting 90-day mortality in a larger cohort of gastric cancer patients that underwent gastrectomy. These models can be implemented preoperatively to inform prognostication and patient selection for surgery. Our study supports broader evaluation and application of AutoML to guide surgical oncologic care.


Materials and methods
The National Cancer Database (NCDB) was used to identify stage I-III gastric cancer patients undergoing gastrectomy between 2004 and 2016. The NCDB is a hospital-based cancer registry developed by the American College of Surgeons Commission on Cancer (CoC) and the American Cancer Society. Data are made available publicly to investigators associated with a CoC-accredited cancer program. It includes data from over 1500 CoC-accredited programs and captures approximately 70% of patients with new cancer diagnoses in the United States 15 . Patients with metastatic disease and palliative-intent surgery were excluded. 26 input features were selected and used to predict 90-day mortality ( Table 1). Characteristics of patients alive at 90 days versus those that died were compared using two-tailed t-tests or chi-square analysis for continuous and categorical variables respectively. IBM SPSS Version 28.0 was utilized for descriptive statistical analyses. This study was reviewed and approved by the Medical College of Wisconsin Institutional Review Board with waiver of informed consent (retrospective study with non-identifiable patient records) and conducted in accordance with relevant guidelines and regulations.
The H2Oai's AutoML 16 package for RStudio was utilized to train 20 ML algorithms that were either linear, decision tree-, or neural network-based. While the linear models are most suited for characterizing linear relationship, the decision trees are better suited for multi-level categorical variables (i.e. yes/no decisions), and the neural networks can best handle complex variable interactions 17 . Stacked ensembles, which are a combination of the trained models, were also generated. Data were split into training and validation sets. fivefold cross-validation was used during model training. Model performance of the top ensemble and top independent model type on the validation set was evaluated using area under the receiver operating characteristic curve (AUC), positive and negative predictive values as well as sensitivity and specificity. To do so, a binary classifier based on 90-day mortality was generated and performance measured based on an F1-threhold optimized for specificity and negative predictive value. Shapley additive explanations plots (SHAP), variable importance heatmaps, and partial dependence plots were generated for model interpretability. Detailed documentation as well as directions for implementation of H2O.ai are freely available online 16 . Conference presentation. Society of Alimentary Tract Annual Meeting 2022, San Diego, California. 39,108 patients with gastric cancer that underwent gastrectomy for gastric cancer between 2004 and 2016 were included in the study (Table 1). Of those patients, 3473 (8.8%) died within 90 days postoperatively. There was a greater proportion of males than females (67.5% vs. 32.5%) in the study, with no significant differences in sex among patients who were and were not alive at 90 days post-surgery. Compared to the patients who were alive 90 days postoperatively, those who died were older (73.5 ± 10.9 years vs. 67.04 ± 12.17 years, p < 0.001), had longer postoperative hospital length of stay (15.85 ± 13.56 days vs. 11.3 ± 10.5, p < 0.001), had a shorter duration between diagnosis and start of treatment (25.9 ± 30.2 days vs. 32.79 ± 32.51 days, p < 0.001), and greater nodal ratio of positive nodes to nodes examined (0.323 ± 0.364 vs. 0.176 ± 0.270, p < 0.0001; Table 1). A higher proportion of patients that were alive at 90 days also received neoadjuvant radiation therapy (16% vs. 10%, p < 0.001) and chemotherapy (21% vs. 14%, p < 0.001), had smaller tumor sizes (64.26 ± 125.86 mm vs. 79.41 ± 151.99 mm, p < 0.001), and lived in urban areas (56% vs. 51%, p < 0.001) with a median household income ≥ $63,000 for their residential area (32% vs. 27%, p < 0.001). Other differences between patient groups are highlighted in Table 1.
AutoML can be feasibly used for predicting 90-day mortality. To assess if 90-day mortality can be predicted using H2O.ai AutoML, 10 models were fit with 31,396 patients in the training set and 7712 in the validation set. The best performing model was a stacked ensemble (fivefold cross validation AUC 0.78; model performance on validation set AUC 0.77; Fig. 1A). The ensemble, when optimized based on F1 threshold (0.34), had a positive predictive value of 28%, negative predictive value of 94%, sensitivity of 43%, and specificity of 89% (Fig. 1B). Patient age, nodal ratio, and length of inpatient stay days since surgery were the three most influential variables across models (Fig. 1C). Partial dependency plots show that greater nodal ratio and longer inpatient stay (up to 90 days) greatly influenced model prediction (Fig. 1D,E).
Prediction of 90-day mortality preoperatively can not only inform patient prognosis, but also help improve patient selection for surgery. Thus, we created models without nodal ratio or inpatient length of stay. Performance of the leading ensemble declined (fivefold cross validation and model performance on validation set AUC 0.71; Fig. 1F www.nature.com/scientificreports/ 21%, negative predictive value of 94%, sensitivity of 42%, and specificity of 85% (Fig. 1G). Patient age remained highly influential in model prediction, along with clinical disease burden and time from diagnosis to treatment (Fig. 1H).

Inclusion of predicted length of stay partially improves AutoML model prediction of 90-day mortality.
Given the importance length of stay had on model performance, we assessed whether a two-step approach could be used where: (1) prediction of patient length of stay using preoperative features (pLOS) followed by (2) prediction of 90-day mortality using input features that included pLOS ( Fig. 2A). Performance of AutoML for predicting LOS is shown in Supplemental Fig. 1. Variables most important for predicting length of stay included patient's income quartile, distal or en bloc gastrectomy, and race.
The pLOS values were then added as an additional input feature for predicting 90-day mortality. 10 models were tested, and the best performing models were a stacked ensemble (fivefold cross validation AUC 0.69 and model performance on validation set AUC 0.74; Fig. 3A) and XGboost (fivefold cross validation AUC 0.69 and model performance on validation set AUC 0.73; Fig. 3C). The ensemble, when optimized based on F1 threshold (0.29), had a positive predictive value of 21%, negative predictive value of 94%, sensitivity of 49%, and specificity of 82% (Fig. 3B). The XGboost model, when optimized based on a F1 threshold of 0.28, had a positive predictive value of 23%, negative predictive value of 94%, sensitivity of 38%, and specificity of 88% (Fig. 3D). The variables that were most influential for predicting 90-day mortality in this multi-layered model included older age, longer pLOS, lower time from diagnosis to treatment, and larger tumor size (Fig. 3E,F). Partial dependency plot for the pLOS confirmed that longer pLOS greatly influenced prediction of 90-day mortality (Fig. 3G).

Inclusion of predicted nodal ratio partially improves AutoML model prediction of 90-day mortality.
Given that the inclusion of pLOS only partially improved model prediction of 90-day mortality, we tested whether inclusion of predicted nodal ratio improved performance (Fig. 2B). Performance of AutoML for predicting nodal ratio is shown in Supplemental Fig. 2. The pNodeRatio values were then used as an additional input feature for predicting 90-day mortality. 10 models were tested, and the best performing model was a stacked ensemble (fivefold cross validation AUC 0.70 and model performance on validation set AUC 0.73; Fig. 4A). The best performing XGboost model had an AUC of 0.68 on fivefold cross validation AUC of 0.71 when tested on validation set (Fig. 4C). The ensemble, when optimized based on F1 threshold (0.29), had a positive predictive value of 19%, negative predictive value of 95%, sensitivity of 54%, and specificity of 79% (Fig. 4B). The XGboost model, when optimized based on F1 threshold (0.27), had a positive predictive value of 20%, negative predictive value of 94%, sensitivity of 42%, and specificity of 84% (Fig. 4D). The variables that were most influential for predicting 90-day mortality in this multi-layered model included older age, pNodeRatio, and clinical disease burden (Fig. 4E,F). Partial dependency plot for the pNodeRatio confirmed that higher nodal ratio greatly influenced prediction of 90-day mortality (Fig. 4G).
Inclusion of both predicted length of stay and nodal ratio does not further improve AutoML model prediction of 90-day mortality compared to models with either variable along. Given incremental improvements in 90-day mortality prediction with models that either had pLOS or pNodeRatio, we next included both predicted variables as input features Fig. 2C). This approach did not significantly improve model performance (leading ensemble AUC of 0.73 on validation set, and leading gradient-boosting model AUC of 0.71; Fig. 5A,C). The ensemble, when optimized based on F1 threshold (0.29), had a positive predictive value of 24%, negative predictive value of 91%, sensitivity of 39%, and specificity of 83% (Fig. 5B). The gradient boosting model, when optimized based on F1 threshold (0.28), had a positive predictive value of 22%, negative www.nature.com/scientificreports/ predictive value of 94%, sensitivity of 38%, and specificity of 87% (Fig. 5D). Both increased pNodeRatio and higher pLOS were influential in predicting 90-day mortality (Fig. 5E-H). Finally, we completed sensitivity analyses stratified by facility type and surgical approach, given the aforementioned heterogeneity in 90-day mortality based on these factors (Supplemental Fig. 3). Model performance was maintained across facility type and surgical approach.

Discussion
The major findings of this study are: (1) in a cohort of stage I-III gastric cancer patients that underwent gastrectomy, AutoML performed well in predicting early postoperative mortality; (2) the generated AutoML models produced predictions that could help with clinical patient prognostication and counseling of those predicted to be high risk; (3) the variables most influential in predicting 90-day mortality include older age, high nodal ratio of positive nodes to nodes examined, and prolonged hospital length of stay following surgery; (4) a multi-step approach that first predicts a postoperative characteristic (i.e. pLOS and pNodeRatio) and then 90-day mortality can be used to design models for preoperative use. Our work shows that AutoML can be feasibly, efficiently, and easily be used for training and validating ML models using commonly collected perioperative factors. To our knowledge, our study is the first to demonstrate the applicability of AutoML for early postoperative mortality prediction in cancer surgery. Thus, in addition to its potential utility for surgical treatment of patients with gastric cancer, our study supports broader evaluation and application of AutoML to guide surgical oncologic care.
Numerous studies have highlighted the importance of predicting mortality among patients with advanced cancers to assist with appropriate treatment planning and patient counseling 3,4,18 . Post-gastrectomy outcomes and mortality have been associated with several factors including stage of the disease, lymph node metastases, co-morbidities from neoadjuvant therapy, and age of the patient 2,19-21 , but few clinical support tools or algorithms have been developed to accurately inform patient prognostication based on perioperative variables. Niu et al. 's review on the application of artificial intelligence within gastric cancer highlights several studies that used ML models to diagnose gastric cancer and predict recurrence and metastasis; however, most of these studies utilized endoscopy or computed tomography images, pathology slices, or genetic features 7 . Image-based prediction models require large quantities of accurately annotated data 7,22 , and acquiring genetic features for all patients adds to the cost of patient care and requires substantial time. One of the most widely used surgical risk calculators was developed by the American College of Surgeons National Surgical Quality Improvement Program (ACS NSQIP). The ACS NSQIP risk calculator previously has been studied for the purpose of predicting mortality following gastrectomy. In comparison to our reported models, the ACS NSQIP risk calculator shows similar, and sometimes worse, performance in predicting mortality in this population 23 . An advantage to our approach  www.nature.com/scientificreports/ is the inclusion of cancer-specific variables including staging, receipt of preoperative oncologic therapies, and tumor characteristics. Furthermore, Lu et al. 's systemic review of 15 articles that utilized ML models to predict early mortality in patients with cancer using electronic health record data showed that model performance ranged from AUCs of 0.71 to 0.92 24 . Unlike those studies, we utilized common data elements found within readily available real world data sources to train our ML models in patients with gastric cancer that underwent non-palliative gastrectomy. While many prior studies of ML models rely on small sample sizes, our study with 39,108 patients highlights promising abilities of AutoML models to predict early-mortality among cancer patients using data from population-level registries. Our approach provides a template for developing cost-effective and easy-to-implement decision-support tools for guiding patient selection for surgical treatment in this population. Our use of an interpretable machine learning approach facilitates the identification of potentially targetable risk factors. Older patient age, higher nodal ratio, and greater number of days between surgery and discharge were the three most influential variables across models in predicting 90-day mortality. This is consistent with Shannon et al. 's multivariate retrospective analysis of patients within NCD with stage I-III gastric adenocarcinoma that underwent total gastrectomy; their results showed that increasing age and a lower number of lymph nodes examined are associated with 90-day mortality 2 . Shu et al. further showed that older age (> 70 years) was associated with increased rate of complications (20% vs. 11% in those < 70 years), and higher 90-day mortality (3.7% vs. 0.5%) in a cohort of 534 patients at a single-institution. Notably, age independently predicted mortality after controlling for tumor biology, cancer stage, adjuvant therapy, and postoperative complications 25 , thereby highlighting the need for careful evaluation and counseling of older patients prior to gastrectomy.
For ensuring clinical utility, the timing of implementing predictive models is crucial. The initial model in this study can inform postoperative patient prognostication and highlighted the importance of postoperative length of stay and nodal ratio in predicting 90-day mortality. This is consistent with previous efforts to enhance prognostication in gastric cancer which reported that the number of nodes examined and nodal positivity independently influence survival in gastric cancer 26,27 . However, preoperative prediction is necessary to assist with both patient prognostication and selection of surgery. To ensure that our predictive models are useful in the preoperative setting, we used a multi-step modeling strategy where we first predicted length of stay and nodal ratio only using parameters available preoperatively. These predicted features were then used as input features in our final model for predicting mortality, which showed high discriminatory capability. This complex strategy was easy to implement through H2O.ai's AutoML tools.
Despite better performance in prediction of pNodeRatio compared to pLOS, inclusion of pLOS provided the most improvement in model performance in predicting 90-day mortality. This suggests that patients that are at higher risk for longer hospital stays are highly susceptible to early postoperative mortality. Our work highlighted that patient's income quartile, undergoing distal or en bloc gastrectomy as well as racial background influenced length of stay predictions. This is in-line with prior studies that show that the extent of resection and type of surgical procedure are independently predictive of postoperative length of stay in patients with gastric cancer 28 . In addition to these factors, patients' preoperative physical function/strength and co-morbidities influence both postoperative complications and length of hospital stay 29,30 . Future models that incorporate these preoperative characteristics may enhance pLOS prediction and subsequent early mortality prediction. Importantly, the congruence between prior research and the variables that were most influential in AutoML models provide confidence in these models' clinical utility.
The influence of hospital length of stay on predicting early mortality also provides an opportunity for implementing clinical programs that help reduce this duration, to then potentially reduce early postoperative mortality. Enhanced Recovery After Surgery (ERAS) protocols have been implemented following gastrectomy 31,32 , and they incorporate preoperative counseling and nutrition, earlier mobilization and feeding following surgery, avoidance of abdominal drains, and nasogastric/nasojejunal decompression 33,34 . Wee et al. 's meta-analysis comparing conventional postoperative care versus ERAS protocols showed that ERAS programs decreased length of stay and care costs but did not significantly alter 30-day postoperative mortality or postoperative morbidity 33 . Weindelmayer's single-institution study of 351 gastric cancer patients reported a reduction in 90-day mortality among patients in the ERAS program (0.8% vs. 4.8% control); however, their overall 90-day mortality was only 2% 35 . Further research is necessary to optimize ERAS programs and to assess whether they reduce early postoperative mortality. Within our dataset, there was a cohort of patients that were still admitted to the hospital past 90 days postoperatively, and while the primary aim of this study was to assess early mortality, further research is necessary to understand predictors of prolonged hospital stays as well as morbidity, mortality, and quality of life outcomes among these populations.
Numerous studies have piloted clinical implementation of machine learning tools. Avati et al. developed a deep neural network that screens electronic health records from of all admitted patients at Stanford Hospital and predicts all-cause mortality within 3-12 months. They implemented the ML algorithm as a screening tool that notifies palliative care of positive predictions 36 , thereby streamlining patient-referrals and demonstrating how ML-based early mortality predictions can improve the efficiency of patient care. Manz et al. developed an MLalgorithm to predict 180-day mortality among oncology clinic patients within a health system in Pennsylvania. Their randomized clinical trial implementing this model along with behavioral nudges (weekly performance feedback to clinicians) showed increased rates of serious illness conservations with high mortality risk patients-a positive clinician behavior that improves end-of-life care 22 . Our results provide the necessary first step towards bedside application by demonstrating the feasibility of using AutoML to produce robust mortality predictions. Specifically, AutoML-based predictions could be used to augment perioperative risk stratification and postoperative treatment planning. Models can be implemented through direct integration with electronic health records as well as through development of websites/applications (as done with NSQIP risk calculators) for bedside use. Future work will focus on developing these strategies for implementation of the model developed in this study. www.nature.com/scientificreports/ A crucial point to emphasize is that we do not advocate for strictly following the output of our model-derived prediction to make clinical decisions. Frequently, clinicians, patients, and caregivers are faced with difficult conversations to decide on the optimal treatment trajectory to pursue. Estimating surgical risk using clinical factors and surgeon judgment is frequently part of these discussions-even in the absence of ML tools. How AutoML can be useful in these circumstances is through augmentation and providing another parameter to help inform shared decisions.
Our results must be interpreted while considering the limitations. While NCDB allows us to train ML models on a large cohort of heterogenous patients, the database itself is limited by missing data 37 , lack of information on the cause of death, and biases introduced by retrospective analysis 2 . Additionally, the database does not include information on patient transfers to hospice care, so we cannot discern what proportion of patients underwent hospice deaths. While our results were consistent with prior work that utilized NCDB 2 , the nearly 9% 90-day mortality seen in our patient cohort is higher compared to other series 38,39 . Although, the other studies had smaller sample sizes and less heterogeneity in treatment centers. Given our goal of making this model broadly applicable, we elected to include all types of gastric resection and type of center where surgery was performed. As expected, 90-day mortality greatly varied across resection and center types. Our sensitivity analysis showed similar model performance in low mortality centers (academic)/ resection types (total/distal gastrectomy). Nonetheless, prior to clinical implementation, models should be validated and optimized based on institutional data; this process is simplified given the easy-to-use nature of the AutoML platform. Finally, while NCDB captures approximately 70% of cancer patients, it only has data from patients that were treated at accredited CoC facilities, and thus is not generalizable to the entire US population 2,37 . Nonetheless, AutoML is able to handle missing data and reasonably predict early mortality in this heterogenous population using only the available features. Our work only focused on one AutoML approach, and further studies are necessary to understand the applicability of other models within surgical risk prediction. Lastly, while we focused on mortality prediction, it is not the only outcome of interest for patients and families considering gastric surgery. Thus, future studies focused on morbidity and quality of life predictions are needed.

Conclusion
While surgery is the only curative therapy for patients with gastric cancer, 90-day postoperative mortality remains high, and prediction of early postoperative mortality is challenging. While machine learning algorithms have the potential to make predictions more accurately when compared to traditional statistical methodologies, the need for computational and statistical expertise has been suggested to limit the widespread application of machine learning within healthcare. In a large cohort of gastric cancer patients that underwent gastrectomy, our study shows that AutoML performs well in predicting mortality. Models can further be optimized for preoperative prediction, thereby not only allowing for robust patient prognostication but also informing patient selection for surgery. We were also able to identify key perioperative variables that were influential in outcome prediction to guide future interventions that mitigate risks of early postoperative mortality. Our work provides a framework for effective, scalable, easy-to-implement, and explainable machine learning to inform clinical decision-making.

Data availability
The datasets generated and/or analyzed during the current study are available in the National Cancer Database Participant Use Files for eligible users: https:// www. facs. org/ quali ty-progr ams/ cancer-progr ams/ natio nal-cancer-datab ase/ puf.